Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hg19 and hg38 analysis from same workflow branch #8

Open
wants to merge 53 commits into
base: master
Choose a base branch
from
Open

Conversation

NagaComBio
Copy link
Member

Merging hg38 developments into the main (master) branch.

Main updates

  • Hg38 related files are in a separate analysis XML
  • Hg19 changes
    • EVS and ExAC are removed from the annotation in hg19 since they were not used for no-control filtering any changes in the high confidence somatic variants. But this changes the column length in the output file.
    • Default local control AF threshold to 0.05 from 0.01, this might increase the number of final high confidence variants.
    • Used user-defined AF thresholds for the 'SNP_germline_support' annotation instead of hard-coded thresholds, this might change the last counts.

Integrated testing

  • Performed different tests and the results are in Phabricator {T5184#87479}

@NagaComBio NagaComBio requested review from GWarsow and vinjana and removed request for GWarsow March 16, 2021 10:34
Conflicts:
	README.md
	resources/analysisTools/snvPipeline/environments/tbi-lsf-cluster.sh
	resources/analysisTools/snvPipeline/filter_PEoverlap.py
	resources/configurationFiles/analysisSNVCalling.xml
@@ -251,6 +251,10 @@ def parseVcf(file,num):
while (l!= ""):
t=l.split('\t')
if (t[0][0] != "#") and isValid(t):
# Skipping the non-primary assembly variants from purity calculations
if t[0].startswith('HLA') or t[0].endswith('_alt'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t -> fields
l -> line

BTW (2 lines up): line[0] != "#" is much clearer than fields[0][0] != "#" (not to speak of t[0][0]).

# Skipping the non-primary assembly variants from purity calculations
if t[0].startswith('HLA') or t[0].endswith('_alt'):
l=vcf.readline()
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is i (next line)? Maybe mapped_chromosome?

@@ -199,6 +199,10 @@ def calculateErrorMatrix(vcfFilename, referenceFilename, errorType):
# 23.05.2016 JB: Excluded multiallelic SNVs
if ',' in split_line[header.index("ALT")]: continue

# 21.02.2023 NP: Excluded SNVs with 'N' before or after "," in context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a short word about why you exclude these? I guess it is not obvious because it was added to the SNV workflow only after years of operation.

@@ -1,3 +1,483 @@
<<<<<<< HEAD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge error.

resources/analysisTools/snvPipeline/filter_PEoverlap.py Outdated Show resolved Hide resolved
resources/analysisTools/snvPipeline/filter_PEoverlap.py Outdated Show resolved Hide resolved
Comment on lines +131 to +132
# Reference file for CRAM files
reference_file = args.refFileName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS reference_file is only used once, in the CRAM branch of the if-else block below. Moving it down will make it clearer

Suggested change
# Reference file for CRAM files
reference_file = args.refFileName

samfile = pysam.Samfile(args.alignmentFile, mode)
elif args.alignmentFile.split(".")[-1] == "cram":
mode += "c"
samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = reference_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = reference_file)
# CRAM needs a reference file.
samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = args.refFileName)

<cvalue name='CHROMOSOME_LENGTH_FILE' value='${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.chrLength.tsv' type="path" />

<cvalue name="CHROMOSOME_INDICES" value="( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y HLA ALT )" type="bashArray" description="Chr indices for the calling"/>
<cvalue name="CHROMOSOME_HLA_CONTIGS_FILE" value='${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.HLA_contigs.bed' type="path" description="HLA contig list"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a description field with a short description of the file content and format? If I saw that correctly, it should be suited for samtools mpileup -R $file, right? That would already be a valuable information. Same for the ALT file.

$ccoord = $ctrl[1];
if ($tcoord == $ccoord) # matching pair found!
if ($tcoord == $ccoord && $current_t_chr eq $current_c_chr) # matching pair found!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I see this correctly, the algorithm takes the two position-sorted VCFs and steps forward the feature -- either tumor or control VCF -- that is expected "earlier" in the algorithm, until it finds a match. Right?

I think, here and elsewhere, it would make more sense and would make the code clearer, if the chromosome comparison came before the coordinate comparison, i.e.

Suggested change
if ($tcoord == $ccoord && $current_t_chr eq $current_c_chr) # matching pair found!
if ($current_t_chr eq $current_c_chr && $tcoord == $ccoord) # matching pair found!

@@ -48,6 +48,7 @@ dat$chromosome <- factor(dat$chromosome, levels = paste0("chr",c(seq(1,22),"X","


chromLength = read.table(file = opt$chromLengthFile, header = F)
chromLength$V1 = gsub("chr", "", chromLength$V1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good practice to define a header during loading of the table. At least it would add a bit of information about the structure of the table.

}
else{
$all++;
@help = split ("\t", $line);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also rename this to @line. I think, it also has advantages if two variables that differ only by the type, but not by the content (in principle) are called the same. Some for @head above.

<cvalue name="CHROM_SIZES_FILE" value="${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.chrLenOnlyACGT_realChromosomes.tsv" type="path" />
<cvalue name='CHROMOSOME_LENGTH_FILE' value='${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.chrLength.tsv' type="path" />

<cvalue name="CHROMOSOME_INDICES" value="( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y HLA ALT )" type="bashArray" description="Chr indices for the calling"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add CHROMOSOME_INDICES_PLOTTING with a description.

* 3.0.0

* Major
* Support for hg38/GRCh38 reference genome and variant calling from ALT and HLA contigs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any new mandatory configuration values? At least list them. Details should be in the XML.
Was the semantics of conf. values changed?

* Major
* Support for hg38/GRCh38 reference genome and variant calling from ALT and HLA contigs.
* Minor
* For hg38: Removing mappability and repeat elements' annotations from penalty calculations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any new optional configuration values? (at least list them)

* skipREMAP: Option to remove repeat elements and mappability from confidence annotations in hg19.
* Removing EVS And ExAC AF from the annotations and no-control workflow filtering
* Support for variant calling from CRAM files
* Bug fix: Removing "quote" around the raw filter option `<RAW_SNV_FILTER_OPTIONS>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But this is a technical/code description. What was the effect of the bug to the user? (probably that multiple RAW_SNV_FILTER_OPTIONS could not be used).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try limiting the (heap) memory consumption of this Python process. See here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is passing over the VCF once. Not sure how much memory it uses, but maybe it is worthwhile limiting the memory with this approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants